Lindsay Jones - Final Project

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(stargazer)


Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Part 1

Research Question

In the United States, wage stagnation has become a hot-button issue for many people in various fields of employment. Graduate students have been at the center of this issue in recent years- strikes for wage increases and cost-of-living adjustments have taken place at multiple universities throughout the country. Because PhD students often do not have the time to earn extra income (and their contracts often prohibit them from pursuing work elsewhere), how much they will earn from their stipend is a huge factor in considering where to pursue their research (Powell, 2004; Soar et al., 2022). My research question is: Is university ownership status (public vs. private) a predictor of the value of a PhD stipend?

Hypothesis

H₀:University ownership status is not a predictor of the value of a PhD stipend.

H₁: University ownership status is a predictor of the value of a PhD stipend.

Dataset

This dataset is comprised of self-reported survey data collected by PhDStipends.com. Respondents are asked their university, department, academic year, and year in the program. They are also asked whether they receive a 12-month or 9-month salary, gross pay, and required fees. PhDStipends automatically calculates the LW Ratio (living wage ratio), which is the stipend divided by the living wage of the county the university is located in.

In addition to this information, I also manually categorized universities by their ownership status as public or private, and assigned each program to 1 of five broader academic disciplines: Business/Policy, Social Science, Natural Science, Formal Science, and Humanities. Due to a computer issue much of my work was lost, so the dataset is currently incomplete. The analysis that follows is based on the information I was able to recover or reenter within a reasonable period of time.

The variables of interest for me are the ownership status, gross pay, program year, and academic discipline.

library(readr)
csv <- read_csv("~/School/UMASS/DACSS 603/Final Project/csv.csv")

Rows: 11394 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): University, State, Status, Department, Category, Academic Year, Pro...
dbl (6): Pay, LW Ratio, 12 M Gross Pay, 9 M Gross Pay, 3 M Gross Pay, Fees

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(csv)

  University           State              Status           Department       
 Length:11394       Length:11394       Length:11394       Length:11394      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   Category              Pay           LW Ratio     Academic Year     
 Length:11394       Min.   :    1   Min.   :0.000   Length:11394      
 Class :character   1st Qu.:20000   1st Qu.:0.880   Class :character  
 Mode  :character   Median :26000   Median :1.130   Mode  :character  
                    Mean   :25741   Mean   :1.094                     
                    3rd Qu.:31500   3rd Qu.:1.325                     
                    Max.   :95000   Max.   :4.500                     
                    NA's   :46      NA's   :423                       
   ProgYear         12 M Gross Pay   9 M Gross Pay   3 M Gross Pay  
 Length:11394       Min.   :     1   Min.   :   15   Min.   :    4  
 Class :character   1st Qu.: 24000   1st Qu.:16500   1st Qu.: 3000  
 Mode  :character   Median : 29000   Median :20000   Median : 5000  
                    Mean   : 28473   Mean   :20121   Mean   : 5185  
                    3rd Qu.: 33000   3rd Qu.:23850   3rd Qu.: 6128  
                    Max.   :140000   Max.   :90000   Max.   :55816  
                    NA's   :3478     NA's   :7940    NA's   :10228  
      Fees      
 Min.   :    1  
 1st Qu.:  500  
 Median : 1000  
 Mean   : 1967  
 3rd Qu.: 2000  
 Max.   :93725  
 NA's   :6906

head(csv)

# A tibble: 6 × 13
  University  State Status Depar…¹ Categ…²   Pay LW Ra…³ Acade…⁴ ProgY…⁵ 12 M …⁶
  <chr>       <chr> <chr>  <chr>   <chr>   <dbl>   <dbl> <chr>   <chr>     <dbl>
1 Florida In… FL    Public Busine… Busine…     1       0 2016-2… 1st           1
2 Duke Unive… NC    Priva… Art,  … Humani…     4       0 2013-2… 6th an…      NA
3 University… KY    Public Sociol… Social…    15       0 2016-2… <NA>         NA
4 West Virgi… VA    Public Econom… Social…    16       0 2022-2… 1st          23
5 Texas A&am… TX    Public Counse… Social…    20      NA 2017-2… 1st          20
6 Stevens In… NJ    Priva… Comput… Formal…    23       0 2020-2… 1st          NA
# … with 3 more variables: `9 M Gross Pay` <dbl>, `3 M Gross Pay` <dbl>,
#   Fees <dbl>, and abbreviated variable names ¹Department, ²Category,
#   ³`LW Ratio`, ⁴`Academic Year`, ⁵ProgYear, ⁶`12 M Gross Pay`

Part 2

Visualizations

I’ll start with a histogram of all stipends, regardless of university ownership status.

viz <- csv %>% filter(Status %in% c("Public", "Private")) %>% filter(Pay<65000)

ggplot(viz, aes(x=Pay)) + 
  geom_histogram(fill="#15391c", color="#FFFFFF")+
  labs(y = "Count", 
       x = "Annual Gross Pay",
       title = "Stipend Distribution")+
  theme_light()+
  theme(plot.title=element_text(hjust=0.5, size = 20))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution appears somewhat normal, with annual pay most frequently in the range of $20,000 to $30,000 annually.

Next I will generate 2 boxplots: one for public universities, and one for private.

viz %>%
  ggplot(
    aes(x=Status, y=Pay, fill=Status)) +
    geom_boxplot(show.legend = FALSE)+
   labs(y = "Annual Gross Pay (USD)", 
       x = "Ownership Status",
       title = "Graduate Workers' Annual Pay")+
   theme_minimal()+
  theme(axis.text.x = element_text(size = 9),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 11),
          plot.caption = element_text(hjust = 0),
          plot.title=element_text(hjust=0.5, size = 20),
          plot.background = element_rect(fill = "#FFFFFF", colour = "#FFFFFF"),
          panel.background = element_rect(fill = "#FFFFFF"))+
  scale_fill_manual(values = c("#1482a5", "#c8e1c8"))

There are quite a few outliers for both categories, but we can see that median pay is higher in private universities than in public universities. There are also significantly more outliers below the 1st quartile in private universities than in public.

Hypothesis Testing

Creating Models

Explanatory Variable: Ownership Status (Status)
Response Variable: Gross Pay (Pay)
Control Variable: Academic Discipline (Category), Program Year (ProgYear)

First I will run a model for gross pay, using as.factor() to convert ownership status into dummy variables.

fit1=lm(Pay ~ as.factor(Status), data = csv)
summary(fit1)


Call:
lm(formula = Pay ~ as.factor(Status), data = csv)

Residuals:
   Min     1Q Median     3Q    Max 
-30255  -4738    678   4825  66678 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              30259.1      134.1  225.68   <2e-16 ***
as.factor(Status)Public  -6936.8      166.1  -41.76   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8433 on 11346 degrees of freedom
  (46 observations deleted due to missingness)
Multiple R-squared:  0.1332,    Adjusted R-squared:  0.1331 
F-statistic:  1744 on 1 and 11346 DF,  p-value: < 2.2e-16

Based on the p-values, it does seem that ownership status is statistically significant with regards to pay. Now I will plot this model.

Next I will create a model adding the control variable “Category” (academic discipline).

fit2=lm(Pay ~ as.factor(Status) + Category, data = csv)
summary(fit2)


Call:
lm(formula = Pay ~ as.factor(Status) + Category, data = csv)

Residuals:
   Min     1Q Median     3Q    Max 
-32298  -4114    131   4697  69120 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              31844.4      393.0  81.020  < 2e-16 ***
as.factor(Status)Public  -7084.4      161.1 -43.968  < 2e-16 ***
CategoryFormal Science   -1483.1      419.0  -3.540 0.000402 ***
CategoryHumanities       -5071.0      437.9 -11.581  < 2e-16 ***
CategoryNatural Science    543.4      396.1   1.372 0.170125    
CategorySocial Science   -3879.8      413.1  -9.391  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8151 on 11342 degrees of freedom
  (46 observations deleted due to missingness)
Multiple R-squared:  0.1904,    Adjusted R-squared:  0.1901 
F-statistic: 533.6 on 5 and 11342 DF,  p-value: < 2.2e-16

All academic disciplines appear to be statistically significant. The R-squared value here is higher than the previous model.

Next I will create a model adding the control variable “ProgYear” (program year).

fit3=lm(Pay ~ as.factor(Status) + ProgYear, data = csv)
summary(fit3)


Call:
lm(formula = Pay ~ as.factor(Status) + ProgYear, data = csv)

Residuals:
   Min     1Q Median     3Q    Max 
-30705  -4704    428   4855  66262 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              30728.4      155.9 197.106  < 2e-16 ***
as.factor(Status)Public  -6990.4      172.5 -40.514  < 2e-16 ***
ProgYear2nd              -1366.1      245.8  -5.558 2.79e-08 ***
ProgYear3rd               -583.8      272.6  -2.142   0.0323 *  
ProgYear4th               -545.0      301.4  -1.809   0.0706 .  
ProgYear5th               -321.1      335.6  -0.957   0.3386    
ProgYear6th and up        -933.7      467.4  -1.998   0.0458 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8349 on 10266 degrees of freedom
  (1121 observations deleted due to missingness)
Multiple R-squared:  0.1412,    Adjusted R-squared:  0.1407 
F-statistic: 281.4 on 6 and 10266 DF,  p-value: < 2.2e-16

Program year does appear to be statistically significant. R-squared is comparable to the original model.

Finally, I will create a model using both control variables.

fit4=lm(Pay ~ as.factor(Status) + Category + ProgYear, data = csv)
summary(fit4)


Call:
lm(formula = Pay ~ as.factor(Status) + Category + ProgYear, data = csv)

Residuals:
   Min     1Q Median     3Q    Max 
-32083  -4059    200   4627  68628 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              32433.0      413.6  78.411  < 2e-16 ***
as.factor(Status)Public  -7138.9      167.0 -42.758  < 2e-16 ***
CategoryFormal Science   -1529.7      437.2  -3.499 0.000469 ***
CategoryHumanities       -5096.5      458.4 -11.118  < 2e-16 ***
CategoryNatural Science    652.1      415.0   1.571 0.116122    
CategorySocial Science   -3921.7      432.0  -9.078  < 2e-16 ***
ProgYear2nd              -1638.3      237.4  -6.902 5.43e-12 ***
ProgYear3rd              -1051.2      263.8  -3.984 6.82e-05 ***
ProgYear4th               -854.3      291.2  -2.934 0.003354 ** 
ProgYear5th               -746.8      324.4  -2.302 0.021351 *  
ProgYear6th and up       -1372.0      451.6  -3.038 0.002390 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8052 on 10262 degrees of freedom
  (1121 observations deleted due to missingness)
Multiple R-squared:  0.2015,    Adjusted R-squared:  0.2007 
F-statistic:   259 on 10 and 10262 DF,  p-value: < 2.2e-16

In this model, the disciplines of Business/Policy and Formal Science are the only ones which are not statistically significant. Multiple and Adjusted R-squared values are the highest of all the models, so it has the lowest variance. Therefore I believe this model is the best fit. A comparison table:

Diagnostic Plots

par(mfrow= c(2,3)); plot(fit1, which=1:6)

par(mfrow= c(2,3)); plot(fit2, which=1:6)

par(mfrow= c(2,3)); plot(fit3, which=1:6)

par(mfrow= c(2,3)); plot(fit4, which = 1:6)

The large number of categorical variables in my data makes plotting any model challenging, but the fit is not great for any model. The normal Q-Q plots consistently violate assumptions of normality. This is likely due to several outliers, especially on the higher end of the pay scale. Similarly, the Residuals vs. Fitted plots are largely homoskedastic but show many outliers.

Summary

We can see that university ownership status is a statistically significant predictor of the value of a PhD stipend, and that graduate workers at private universities earn more than those at public universities. This may indicate a public-private pay gap in American universities. However, I believe under- and over-reporting stipend values created a lot of variability. Further research should consider pulling from official databases such as university payrolls.

It is ultimately unsurprising to see that graduate workers at private universities are paid more, given the nature of funding differences between private and public universities. It is also unsurprising that the median wage for either type of university is very low. I hope more research can be done with the hope of improving living and working conditions for graduate student workers.

References

Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Living Wage Calculator. (n.d.). Retrieved October 10, 2022, from https://livingwage.mit.edu/

Powell, K. Stipend survival. Nature 428, 102–103 (2004). https://doi.org/10.1038/nj6978-102a

Emily Roberts & Kyle Roberts. (2022, October 10). PhD stipends Dataset. http://www.phdstipends.com/csv

Soar, M., Stewart, L., Nissen, S. et al. Sweat Equity: Student Scholarships in Aotearoa New Zealand’s Universities. NZ J Educ Stud (2022). https://doi.org/10.1007/s40841-022-00244-5